The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa. Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.
The objective of this competition is to explore and build a linear regression model that will predict the spending behaivior of tourists visiting Tanzania.The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.
The dataset describes 6476 rows of up-to-date information on tourist expenditure collected by the National Bureau of Statistics (NBS) in Tanzania.The dataset was collected to gain a better understanding of the status of the tourism sector and provide an instrument that will enable sector growth. The survey covers seven departure points, namely: Julius Nyerere International Airport, Kilimanjaro International Airport, Abeid Amani Karume International Airport, and the Namanga, Tunduma, Mtukula and Manyovu border points.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
# Loading data
df=pd.read_csv('Train .csv')
df_test=pd.read_csv('Test .csv')
Final_df= df_test.copy()
# Data Preview
df.head()
| ID | country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | ... | package_transport_tz | package_sightseeing | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | total_cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tour_0 | SWIZERLAND | 45-64 | Friends/Relatives | 1.0 | 1.0 | Leisure and Holidays | Wildlife tourism | Friends, relatives | Independent | ... | No | No | No | No | 13.0 | 0.0 | Cash | No | Friendly People | 674602.5 |
| 1 | tour_10 | UNITED KINGDOM | 25-44 | NaN | 1.0 | 0.0 | Leisure and Holidays | Cultural tourism | others | Independent | ... | No | No | No | No | 14.0 | 7.0 | Cash | Yes | Wonderful Country, Landscape, Nature | 3214906.5 |
| 2 | tour_1000 | UNITED KINGDOM | 25-44 | Alone | 0.0 | 1.0 | Visiting Friends and Relatives | Cultural tourism | Friends, relatives | Independent | ... | No | No | No | No | 1.0 | 31.0 | Cash | No | Excellent Experience | 3315000.0 |
| 3 | tour_1002 | UNITED KINGDOM | 25-44 | Spouse | 1.0 | 1.0 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Package Tour | ... | Yes | Yes | Yes | No | 11.0 | 0.0 | Cash | Yes | Friendly People | 7790250.0 |
| 4 | tour_1004 | CHINA | 1-24 | NaN | 1.0 | 0.0 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Independent | ... | No | No | No | No | 7.0 | 4.0 | Cash | Yes | No comments | 1657500.0 |
5 rows × 23 columns
df.duplicated().sum()
0
df.isnull().sum()
ID 0 country 0 age_group 0 travel_with 1114 total_female 3 total_male 5 purpose 0 main_activity 0 info_source 0 tour_arrangement 0 package_transport_int 0 package_accomodation 0 package_food 0 package_transport_tz 0 package_sightseeing 0 package_guided_tour 0 package_insurance 0 night_mainland 0 night_zanzibar 0 payment_mode 0 first_trip_tz 0 most_impressing 313 total_cost 0 dtype: int64
df_test.duplicated().sum()
0
df_test.isnull().sum()
ID 0 country 0 age_group 0 travel_with 327 total_female 1 total_male 2 purpose 0 main_activity 0 info_source 0 tour_arrangement 0 package_transport_int 0 package_accomodation 0 package_food 0 package_transport_tz 0 package_sightseeing 0 package_guided_tour 0 package_insurance 0 night_mainland 0 night_zanzibar 0 payment_mode 0 first_trip_tz 0 most_impressing 111 dtype: int64
# For the travel_with i decided to go with most frequent option "Alone"
df.travel_with.fillna('Alone',inplace=True)
# Most_impressing column captured the most frequent option "Friendly People"
df.most_impressing.fillna('Friendly People',inplace=True)
# For the female and male columns, filled them with their respective mode
df.total_female.fillna(df.total_female.mode()[0],inplace = True)
df.total_male.fillna(df.total_female.mode()[0],inplace = True)
# For the travel_with i decided to go with most frequent option "Alone"
df_test.travel_with.fillna('Alone',inplace=True)
# Most_impressing column captured the most frequent option "Friendly People"
df_test.most_impressing.fillna('Friendly People',inplace=True)
# For the female and male columns, filled them with their respective mode
df_test.total_female.fillna(df.total_female.mode()[0],inplace = True)
df_test.total_male.fillna(df.total_female.mode()[0],inplace = True)
# Descriptive Summary
df.describe()
| total_female | total_male | night_mainland | night_zanzibar | total_cost | |
|---|---|---|---|---|---|
| count | 4809.000000 | 4809.000000 | 4809.000000 | 4809.000000 | 4.809000e+03 |
| mean | 0.926804 | 1.009565 | 8.488043 | 2.304429 | 8.114389e+06 |
| std | 1.287841 | 1.138273 | 10.427624 | 4.227080 | 1.222490e+07 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.900000e+04 |
| 25% | 0.000000 | 1.000000 | 3.000000 | 0.000000 | 8.121750e+05 |
| 50% | 1.000000 | 1.000000 | 6.000000 | 0.000000 | 3.397875e+06 |
| 75% | 1.000000 | 1.000000 | 11.000000 | 4.000000 | 9.945000e+06 |
| max | 49.000000 | 44.000000 | 145.000000 | 61.000000 | 9.953288e+07 |
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True ,cmap="magma")
plt.show()
Questions:
What are the top 5 countries with the highest spending statistic ?
which age-group are the highest spenders and who are the over all highest spenders by travel with?
which country are have the most spending tourists?
what is the average number of nights a toursits spends in Tanzania mainland?
what is the average number of nights a toursits spends in Zanzibar?
what is the most prefered payment mode by tourists?
Highlight the Aspects of tourism that are more profitable and in which it is worthwhile to invest in
what is the most sort after food by tourists?
dats=df.groupby(['country'], sort=False)['total_cost'].sum().reset_index()
#dats= df.groupby('country').agg({'total_cost':['sum','count']})
print(dats)
#dats.to_frame()
country total_cost 0 SWIZERLAND 7.078238e+08 1 UNITED KINGDOM 3.808383e+09 2 CHINA 4.296282e+08 3 SOUTH AFRICA 2.594805e+09 4 UNITED STATES OF AMERICA 8.890832e+09 .. ... ... 100 URUGUAY 1.657500e+05 101 MORROCO 1.491750e+06 102 THAILAND 1.408875e+06 103 BERMUDA 2.000000e+05 104 ESTONIA 2.817750e+06 [105 rows x 2 columns]
# To Find the top 5 countries with the highest spending statistic
top_country= dats.nlargest(5,['total_cost'])
top_country
| country | total_cost | |
|---|---|---|
| 4 | UNITED STATES OF AMERICA | 8.890832e+09 |
| 1 | UNITED KINGDOM | 3.808383e+09 |
| 21 | ITALY | 3.762160e+09 |
| 20 | FRANCE | 3.344496e+09 |
| 30 | AUSTRALIA | 2.743132e+09 |
px.bar(top_country, x = 'country', y = 'total_cost', title = 'TOP 5 COUNTRIES WITH THE HIGHEST SPENDING', color_discrete_sequence = ['darkred'])
from plotnine import ggplot, aes, geom_boxplot, geom_bar, facet_wrap, theme, ggtitle
ggplot(df,aes(x='age_group',y='total_cost'))+ \
geom_boxplot(color='lightskyblue',fill=['c','g','y','r'])+ ggtitle("Age_group Total_cost boxplot")
<ggplot: (118565977426)>
From the above it can be seen that the highest age-group spenders is the 25-44 followed by 45-64 age-group and the 65+ group spends the least.
# The over all highest spenders by travel with
ggplot(df,aes(x='travel_with',y='total_cost'))+ \
geom_boxplot(colour="green",fill="lightskyblue")+ ggtitle("Travel_with Total_cost boxplot")
<ggplot: (118578927330)>
From the above, tourist spending was more with Friends/Relatives, followed by with Spouse and children.
#com_dats=df.groupby(['age_group','travel_with'])['total_cost'].sum().nlargest().reset_index()
#print(com_dats)
#ggplot(com_dats,aes(x='age_group',y='total_cost',fill='travel_with'))+ \
#geom_bar(stat= "identity")+ ggtitle("Age_group and Travel_with")
From the Descriptive summary in early part of this project the mean for night_mainland is 8.488043, therefore the average number of nights a toursits spends in Tanzania mainland is 8 nights approx.
From the Descriptive summary in early part of this project the mean for night_zanzibar is 2.304429, therefore the average number of nights a toursits spends in Tanzania mainland is 2 nights approx.
sns.countplot(x='payment_mode',data=df,hatch='.')
<AxesSubplot:xlabel='payment_mode', ylabel='count'>
The most prefered payment mode by tourist is Cash as clearly seen above.
Activity_dats=df.groupby(['main_activity'], sort=False)['total_cost'].sum().reset_index()
#dats= df.groupby('country').agg({'total_cost':['sum','count']})
print(Activity_dats)
main_activity total_cost 0 Wildlife tourism 2.393484e+10 1 Cultural tourism 1.432819e+09 2 Mountain climbing 4.359085e+08 3 Beach tourism 7.712958e+09 4 Conference tourism 3.782597e+09 5 Hunting tourism 8.734764e+08 6 Bird watching 1.560128e+08 7 business 4.712545e+08 8 Diving and Sport Fishing 2.222264e+08
px.bar(Activity_dats, x = 'main_activity', y = 'total_cost', title = 'Main Activity Statistics', color_discrete_sequence = ['green'])
From the Analysis above the most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”. It is therefore wise and worthwhile to invest in this sectors.
# Loading wrangled data
df.head()
| ID | country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | ... | package_transport_tz | package_sightseeing | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | total_cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tour_0 | SWIZERLAND | 45-64 | Friends/Relatives | 1.0 | 1.0 | Leisure and Holidays | Wildlife tourism | Friends, relatives | Independent | ... | No | No | No | No | 13.0 | 0.0 | Cash | No | Friendly People | 674602.5 |
| 1 | tour_10 | UNITED KINGDOM | 25-44 | Alone | 1.0 | 0.0 | Leisure and Holidays | Cultural tourism | others | Independent | ... | No | No | No | No | 14.0 | 7.0 | Cash | Yes | Wonderful Country, Landscape, Nature | 3214906.5 |
| 2 | tour_1000 | UNITED KINGDOM | 25-44 | Alone | 0.0 | 1.0 | Visiting Friends and Relatives | Cultural tourism | Friends, relatives | Independent | ... | No | No | No | No | 1.0 | 31.0 | Cash | No | Excellent Experience | 3315000.0 |
| 3 | tour_1002 | UNITED KINGDOM | 25-44 | Spouse | 1.0 | 1.0 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Package Tour | ... | Yes | Yes | Yes | No | 11.0 | 0.0 | Cash | Yes | Friendly People | 7790250.0 |
| 4 | tour_1004 | CHINA | 1-24 | Alone | 1.0 | 0.0 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Independent | ... | No | No | No | No | 7.0 | 4.0 | Cash | Yes | No comments | 1657500.0 |
5 rows × 23 columns
To make sure number of male , number of female , and all other features supposed to be integer ,should be converted to be int, help to bring the problrem into reality.
# convert float dtypes to int[total_female,total_male,night_mainland,night_zanzibar]
df["total_female"] = df['total_female'].astype('int')
df["total_male"] = df['total_male'].astype('int')
df["night_mainland"] = df['night_mainland'].astype('int')
df["night_zanzibar"] = df['night_zanzibar'].astype('int')
# Carry out same operation on our Test dataset
# convert float dtypes to int[total_female,total_male,night_mainland,night_zanzibar]
df_test["total_female"] = df_test['total_female'].astype('int')
df_test["total_male"] = df_test['total_male'].astype('int')
df_test["night_mainland"] = df_test['night_mainland'].astype('int')
df_test["night_zanzibar"] = df_test['night_zanzibar'].astype('int')
# Generate new features from some columns
df["total_people"] = df["total_female"] + df["total_male"]
df["total_nights"] = df["night_mainland"] + df["night_zanzibar"]
# Generate new features from some columns on Test Dataset
df_test["total_people"] = df_test["total_female"] + df_test["total_male"]
df_test["total_nights"] = df_test["night_mainland"] + df_test["night_zanzibar"]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4809 entries, 0 to 4808 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4809 non-null object 1 country 4809 non-null object 2 age_group 4809 non-null object 3 travel_with 4809 non-null object 4 total_female 4809 non-null int32 5 total_male 4809 non-null int32 6 purpose 4809 non-null object 7 main_activity 4809 non-null object 8 info_source 4809 non-null object 9 tour_arrangement 4809 non-null object 10 package_transport_int 4809 non-null object 11 package_accomodation 4809 non-null object 12 package_food 4809 non-null object 13 package_transport_tz 4809 non-null object 14 package_sightseeing 4809 non-null object 15 package_guided_tour 4809 non-null object 16 package_insurance 4809 non-null object 17 night_mainland 4809 non-null int32 18 night_zanzibar 4809 non-null int32 19 payment_mode 4809 non-null object 20 first_trip_tz 4809 non-null object 21 most_impressing 4809 non-null object 22 total_cost 4809 non-null float64 23 total_people 4809 non-null int32 24 total_nights 4809 non-null int32 dtypes: float64(1), int32(6), object(18) memory usage: 826.7+ KB
# let's remove ID Column
df.drop('ID', axis='columns', inplace=True)
# Then encode objects into numeric
for colname in df.select_dtypes("object"):
df[colname],_=df[colname].factorize()
df.columns
Index(['country', 'age_group', 'travel_with', 'total_female', 'total_male',
'purpose', 'main_activity', 'info_source', 'tour_arrangement',
'package_transport_int', 'package_accomodation', 'package_food',
'package_transport_tz', 'package_sightseeing', 'package_guided_tour',
'package_insurance', 'night_mainland', 'night_zanzibar', 'payment_mode',
'first_trip_tz', 'most_impressing', 'total_cost', 'total_people',
'total_nights'],
dtype='object')
df.head()
| country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | package_transport_int | ... | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | total_cost | total_people | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 674602.5 | 2 | 13 |
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 14 | 7 | 0 | 1 | 1 | 3214906.5 | 1 | 21 |
| 2 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 31 | 0 | 0 | 2 | 3315000.0 | 1 | 32 |
| 3 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | 2 | 1 | 0 | ... | 1 | 0 | 11 | 0 | 0 | 1 | 0 | 7790250.0 | 2 | 11 |
| 4 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 7 | 4 | 0 | 1 | 3 | 1657500.0 | 1 | 11 |
5 rows × 24 columns
# Spliting dependent and independent features
features_cols = df.drop(["total_cost"],1)
cols = features_cols.columns
target=df["total_cost"]
C:\Users\DAVID\AppData\Local\Temp\ipykernel_4016\3888316330.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
df[cols].shape , target.shape
((4809, 23), (4809,))
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df[cols],target, test_size=0.20, random_state = 2020)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(3847, 23) (3847,) (962, 23) (962,)
# %Model initialization & training
model = LinearRegression().fit(X_train,y_train)
# %Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print(model.intercept_)
print(model.coef_)
-1532685.355596223
[ -24916.7993009 248810.43669093 1070681.41948759 607557.37808411
452550.05619344 -314383.20232077 -197859.23624628 212878.24997176
747305.50233752 4694916.68058624 1322140.77706261 616850.95182881
1651749.77660119 2774593.61306472 406779.60576743 223327.50069718
9307.49419598 54591.45769385 2567290.08395854 290067.16241359
-31166.19595691 1060107.43427755 63898.95188983]
predictions= model.predict(X_test)
predictions
array([ 8.50050572e+05, 1.24125965e+07, 6.80871403e+06, 1.31156342e+07,
1.27806337e+04, 1.27276763e+07, 2.88692596e+06, 8.93223640e+06,
1.06980007e+07, -1.54734078e+05, 8.25862462e+05, 1.66198670e+07,
1.62435771e+07, 4.82212769e+06, 8.86829431e+06, 5.33223011e+06,
2.07295499e+06, 4.86418098e+06, 1.36601951e+07, 1.02048605e+07,
2.31018377e+06, 1.22895274e+06, 2.53573804e+06, 5.85448915e+05,
-1.29308411e+06, 1.73162990e+07, 1.23436597e+06, 1.29792679e+07,
2.89214445e+07, 1.23593751e+06, 3.68618178e+05, -9.46148794e+05,
1.05053299e+06, 3.71959909e+06, 6.60322393e+06, -1.84693771e+06,
1.21401216e+07, 9.32888116e+05, 1.46410289e+07, -5.51387146e+05,
3.33073545e+07, 2.61016229e+06, 1.70527451e+07, 1.12436216e+07,
1.80653126e+07, 2.44474829e+06, 2.25303930e+06, -1.85969483e+06,
1.26119237e+07, 5.25052396e+06, 1.46497417e+07, 3.87349083e+05,
4.06874294e+06, 6.62582924e+06, 1.65203097e+06, 1.38778393e+07,
2.02151122e+07, 1.72334458e+07, 8.63087807e+06, 1.28541823e+07,
5.23954584e+06, 1.49006141e+07, 1.48648580e+07, 1.64095330e+07,
3.48333320e+06, 5.55554654e+06, 1.25078753e+07, 3.95041363e+06,
-1.28183943e+05, 1.13486846e+07, 2.21203922e+06, 1.41998892e+07,
8.91874550e+06, 1.34514731e+07, 2.40439420e+06, 9.85455539e+04,
1.31282299e+07, -1.46139902e+04, 1.26317155e+07, 1.05723909e+07,
3.09082769e+06, 1.84580910e+07, 3.30965016e+06, 2.94732364e+05,
1.11353212e+07, 9.92121154e+06, 1.24499721e+07, 1.19765105e+07,
1.79383443e+06, 9.68219503e+05, 9.04215757e+06, 5.05279263e+05,
1.27666414e+07, 1.97633976e+07, 1.67383199e+07, 8.56021731e+06,
5.03024236e+06, 1.22105648e+07, 1.27466049e+06, 1.52987476e+07,
1.79976917e+07, -1.81857369e+06, 1.46274223e+07, 2.10350194e+07,
1.12253018e+07, 1.47026312e+07, 8.99281000e+06, 2.11563640e+06,
3.14142637e+05, 9.56759853e+05, 6.70344267e+06, 5.14414365e+05,
3.31928820e+06, 2.72090325e+06, 2.31941405e+07, 2.23329819e+07,
4.39788609e+06, 6.47168009e+06, 9.31788528e+06, 4.90442968e+06,
1.35592363e+06, 1.98709393e+07, 3.17091678e+06, 1.89448771e+07,
-1.07564003e+05, 1.54137367e+07, 8.88306681e+04, 5.34648821e+06,
1.98717739e+07, -1.59059225e+06, 4.67945678e+06, 2.77703156e+06,
1.75333897e+06, 9.47472104e+05, 1.46888689e+06, 1.78820778e+07,
1.86461621e+07, 9.64175666e+06, -1.61257574e+06, 1.39175803e+07,
1.03707431e+07, 1.69675995e+07, 1.62818934e+07, 4.72706853e+05,
6.14238838e+06, 2.43404527e+07, 2.46790367e+06, 1.32872109e+07,
1.14822189e+06, 7.49448525e+05, 7.02932269e+05, 2.03315308e+06,
1.01114099e+07, 2.16582120e+07, 1.22340456e+06, 1.56015778e+07,
-3.58789486e+04, 3.67414833e+06, 2.12613344e+07, 7.93946091e+06,
2.29104803e+07, 7.72816388e+06, 1.94798726e+07, 1.39618436e+06,
6.14505486e+06, 5.88734773e+06, 1.51400234e+07, 1.49106335e+06,
2.06342153e+06, 1.33188000e+06, 3.98505919e+06, 4.35125206e+06,
2.19186064e+07, 1.05642610e+07, 1.60518678e+07, 4.65151761e+05,
7.34052611e+05, 5.26422208e+05, 4.96787260e+06, 1.32709909e+07,
1.71764936e+07, 1.35964782e+07, 1.05272846e+07, 2.12592597e+07,
6.56551628e+06, 2.02946051e+07, 1.68574151e+07, 2.93700354e+06,
6.65185959e+04, 7.32042401e+06, 5.22368498e+06, 5.72933220e+06,
1.33921441e+07, 5.62602565e+05, 1.35634309e+07, 1.62155819e+06,
1.75304673e+06, 1.02558299e+07, 6.32121022e+06, 1.12092501e+07,
1.10249104e+06, 5.61685409e+06, 1.50117395e+06, 3.62227926e+06,
8.08892963e+05, 7.89017824e+06, 7.70156967e+06, 1.52973132e+06,
3.31785801e+05, 1.60154707e+07, 7.75091740e+06, 2.67678797e+06,
4.23401521e+06, 2.13611447e+06, 9.49084722e+06, 1.21256810e+07,
9.10332175e+06, 2.09220053e+06, 2.43284346e+06, -3.90531918e+05,
3.25847557e+06, 1.64132214e+07, 1.19401934e+07, 1.99966176e+07,
4.22617178e+06, 1.34860136e+07, 5.97788482e+05, 6.48579514e+06,
9.66940310e+06, 1.40718057e+07, 1.02882450e+07, 2.10520995e+07,
3.22355613e+06, 6.53195151e+06, 4.07115220e+06, 1.05902769e+07,
1.48134053e+07, 1.88398280e+07, 2.74700635e+06, 5.16564935e+05,
6.61744573e+06, 2.17717955e+06, 1.09626125e+07, 1.80522501e+07,
1.70592125e+07, 1.38890534e+07, 4.82911104e+06, 1.40738132e+07,
4.41766901e+06, 9.88677264e+06, 1.61912304e+06, 1.57597256e+07,
4.26729073e+06, 3.87856772e+06, 1.61004019e+06, 6.55558905e+06,
3.55779761e+06, 1.34675663e+07, 3.09139127e+06, 4.39545829e+06,
9.57103642e+06, 2.60293964e+05, 1.28317378e+06, 9.96147587e+05,
8.94555950e+05, 1.35978627e+07, -1.18863271e+06, 1.66821085e+07,
3.52112712e+06, 2.69727313e+06, -1.08181357e+06, 6.38878002e+06,
1.77157099e+07, 5.01339306e+05, 1.03527953e+07, 9.50168614e+06,
1.44377919e+07, 3.15256041e+06, 1.87387744e+06, 1.58434663e+07,
-9.78700913e+05, 9.57338892e+06, 2.06116939e+07, 7.98378599e+06,
1.33209038e+06, 9.44173267e+06, 2.09950019e+06, 8.67961848e+06,
6.17752899e+06, 1.61040098e+07, -1.17766316e+05, 1.47230461e+07,
9.91086775e+05, 9.39197054e+06, 1.57116027e+06, 4.55817491e+06,
1.04084726e+07, 5.33926416e+05, 1.63057895e+05, 1.58147895e+07,
-3.28704733e+05, -5.48819215e+05, 1.46003033e+07, 1.83631328e+05,
1.71234287e+07, 9.97720110e+06, 3.14142637e+05, 6.92470554e+06,
1.20354208e+07, 3.72463808e+06, 2.64757278e+06, 3.33161768e+06,
3.87468384e+06, 5.03569718e+06, 2.92427622e+06, 6.03652521e+06,
2.07709163e+06, 6.88715928e+06, 2.77176407e+06, 2.26249830e+06,
5.37529519e+05, 1.43249684e+07, 9.40038371e+06, -2.70274577e+05,
2.08116919e+07, 1.88977553e+07, 1.75297125e+07, 1.14975468e+06,
6.73565786e+06, 1.71079180e+06, 6.12487988e+06, 5.88004769e+06,
5.92301193e+05, -1.55024334e+06, 9.29603955e+06, 1.61983057e+06,
1.80081428e+06, 2.74370362e+06, 1.63892610e+07, -2.37501293e+06,
1.96238016e+07, 1.25799890e+07, 7.96904573e+05, 1.84825116e+07,
5.06914056e+06, 1.41494464e+07, -8.66211440e+05, 1.46487025e+07,
1.43028104e+06, 1.18294684e+07, 1.02857491e+07, -1.30143291e+06,
-1.93853288e+06, 1.82982283e+07, 2.61901301e+06, 3.40245832e+05,
-5.62237825e+05, 1.45289604e+07, 2.55135948e+06, 2.97002152e+05,
2.46718028e+07, 5.39227584e+06, 1.42974563e+07, 1.22126896e+07,
-1.32028006e+05, 1.78626849e+06, -4.55503492e+05, 9.11037222e+06,
1.51493866e+07, 3.16571587e+06, 1.12033922e+07, 3.34268216e+06,
1.33490759e+07, 1.04861602e+07, 1.69811037e+07, 1.14172660e+07,
1.20051231e+06, 4.84208387e+05, 3.30300259e+06, 1.17439336e+07,
3.48044795e+06, 1.06126623e+07, -5.49774969e+04, 3.52668106e+05,
1.50560362e+08, -5.80389680e+05, 9.77605021e+06, 3.41580390e+06,
1.04542689e+07, -2.35803082e+06, 1.40046423e+07, 1.00841316e+07,
1.05167279e+07, 6.38669093e+06, 8.23478629e+05, 4.76776386e+06,
1.40550995e+06, 2.16070153e+06, 5.61452994e+06, 7.07208050e+06,
6.82151150e+06, 2.19076222e+06, 5.03199415e+06, 1.55332417e+06,
2.96187924e+06, 1.73337084e+07, -1.26289633e+05, 6.87458017e+05,
1.85438885e+07, 1.25356843e+07, 7.91361806e+06, 2.78366919e+07,
4.26132812e+06, 1.02690082e+06, 7.92405837e+06, -2.01263389e+06,
4.07654376e+06, -4.06282912e+06, 7.93061861e+06, 2.29813333e+07,
8.45246872e+06, -8.21633935e+05, 1.66471246e+07, 1.31928604e+07,
1.37161001e+07, 1.14875740e+07, 4.11614387e+06, 4.36793558e+06,
1.37845184e+07, -6.74725899e+05, 1.39399979e+07, 1.31374824e+07,
1.62979492e+07, 1.08446947e+06, 9.73244930e+06, 2.45929484e+06,
1.71645967e+07, 6.36111785e+06, 1.77598203e+07, 1.92945215e+06,
4.16872814e+06, 6.93691478e+04, 5.36866018e+06, 8.27200339e+06,
8.58777790e+05, 2.33052866e+06, 2.50641062e+07, 2.82297859e+06,
1.13649723e+07, 1.07523256e+07, 7.29965591e+06, 1.64797726e+07,
1.09213744e+05, 5.38570995e+06, 1.77317781e+07, 6.13089752e+05,
4.55731790e+06, 3.53562786e+06, 3.48359599e+06, 2.06355746e+07,
1.26566070e+07, 2.00385303e+06, 5.77418739e+06, 4.80982168e+06,
1.01320071e+07, 3.07390831e+05, 7.17515330e+05, 7.82292093e+05,
2.15050535e+07, 1.61245521e+07, 1.46775352e+06, 1.00812037e+07,
3.16491698e+07, 2.96761759e+07, 1.30849549e+07, 9.27868318e+06,
2.35584039e+07, 1.44980839e+07, -2.69799919e+04, 2.30362497e+06,
-6.33027844e+05, 9.42444954e+06, 1.58368646e+07, 9.54339805e+06,
1.87606726e+07, 8.90360204e+06, 2.17896018e+06, 9.77133836e+06,
2.44848660e+07, 1.70093455e+07, 2.78444916e+05, 1.40295462e+07,
7.00229989e+06, -3.44568003e+05, 1.03316865e+07, 2.34860198e+06,
1.22520199e+07, 1.88674723e+07, 9.19592910e+06, 3.08322199e+06,
2.20967071e+06, 1.20788996e+07, 1.50843796e+07, 3.12026780e+07,
-8.47705104e+05, 3.44121341e+06, 8.36266088e+06, 8.21288514e+05,
1.38200363e+07, 2.80217643e+06, 1.79488970e+07, 1.54940391e+06,
8.13298245e+06, 1.06862904e+07, 6.42552275e+06, 1.87299433e+07,
1.79077592e+07, 1.58031147e+06, -5.45342170e+04, 2.09320552e+07,
1.43713360e+07, 1.66242222e+07, 1.77073134e+07, 1.36187719e+07,
-8.10806917e+04, 4.26761816e+06, 6.99539826e+06, 8.51939689e+06,
1.70342511e+07, 4.06009350e+06, 7.14278200e+05, 9.33094255e+06,
3.92652676e+06, 8.76529273e+06, -7.18278996e+05, 1.24134123e+07,
1.04712628e+07, 1.69588122e+07, -3.91936538e+05, 1.42350592e+07,
8.82048092e+06, 9.72987065e+06, 3.08767756e+06, 1.82902313e+06,
1.02648088e+06, 9.18249034e+05, 1.82289491e+04, 2.78820412e+07,
-7.96656835e+05, -1.97286777e+06, 9.95423131e+05, 1.48319472e+05,
1.77218058e+06, 2.25783286e+04, 4.54150358e+05, 2.47739437e+06,
7.56275245e+06, 2.17896018e+06, 2.54172365e+06, 1.46692017e+07,
7.00460088e+06, 1.17201561e+07, -1.67156776e+05, 6.77065736e+05,
3.72104496e+06, 1.20026575e+07, 1.19417663e+07, 2.11827859e+07,
3.04749831e+06, 5.58067401e+05, 1.31832972e+07, 7.71549827e+05,
1.95362315e+07, 1.01229226e+07, 9.69142560e+05, 1.83430645e+07,
1.23369775e+07, 1.66185608e+07, 9.90152793e+06, 2.17028775e+07,
1.47831956e+06, 1.60454750e+07, -5.36285443e+04, 9.49230258e+04,
1.54635542e+07, 1.08502690e+07, 2.44751025e+07, 6.77065736e+05,
-2.06718013e+06, 2.70701513e+06, 1.18838572e+07, 5.57012916e+05,
5.43968999e+06, 2.45086500e+06, 1.27103677e+07, 1.57247980e+07,
6.06753499e+05, 3.11248328e+07, 1.26424215e+06, 1.63432987e+06,
2.77117580e+06, 3.77390955e+06, 2.36809683e+07, 1.64021289e+07,
-1.09081642e+06, 1.29668701e+06, 3.11173759e+06, 1.29203521e+06,
1.58176796e+07, 4.21554694e+06, 1.25324364e+07, 1.11633969e+07,
5.85814781e+06, 1.78716861e+07, 1.76055439e+06, 5.89250904e+06,
3.78333867e+05, 8.74287983e+05, 1.80653126e+07, -1.26018155e+05,
6.49671895e+06, -1.20093253e+05, 1.10606211e+07, 3.69302440e+06,
1.47992828e+07, 1.93554270e+07, 1.37966231e+07, 6.97557543e+05,
2.40790672e+06, -7.43730537e+05, 1.02558299e+07, 7.19535684e+06,
1.45685666e+04, 1.18628887e+06, -5.74321650e+05, 2.53454995e+06,
4.84312874e+05, 3.89210429e+06, 1.40698078e+07, 9.43439037e+06,
1.76867666e+06, -2.17844436e+06, 1.40352463e+06, 1.99509631e+07,
4.03541530e+06, 1.72184746e+07, 4.78577594e+06, 1.95795653e+07,
6.62760264e+06, 9.58358982e+06, 4.95173722e+06, 6.80785188e+06,
1.01088467e+07, -8.56312228e+05, 5.47940416e+06, 1.13862457e+07,
1.24478044e+07, 5.26828006e+06, 1.14584967e+07, 3.94117046e+06,
2.16466030e+06, 4.34574291e+06, 3.44734609e+06, 1.13476566e+07,
-1.92688290e+04, 1.32609619e+07, 6.07841941e+06, 4.08268252e+06,
6.43081242e+06, 4.10013669e+06, 3.16644905e+05, 1.17344761e+06,
2.34753584e+07, 1.45707400e+07, 3.54180032e+05, 2.09813934e+06,
3.57527465e+06, 4.16307090e+06, 1.22188913e+07, 2.69958835e+07,
-1.04427368e+06, 1.02558299e+07, 1.71220541e+07, 4.72659147e+06,
2.61359900e+06, 4.30888822e+06, 1.10873925e+07, -6.46895223e+05,
1.03687616e+07, 1.75106013e+06, 2.86715637e+06, 5.00882220e+05,
3.58891561e+06, 9.86826491e+06, 1.14270943e+07, -1.48476085e+05,
5.37683765e+05, 6.26092342e+06, 2.47399086e+06, 2.66281259e+06,
7.55673351e+06, 2.28650973e+06, 7.06572672e+06, 9.27866151e+06,
2.94063193e+06, 2.84523172e+06, 1.70712970e+07, -9.63902425e+04,
1.68860401e+07, -2.35788948e+06, 5.34582704e+05, 1.51331575e+07,
2.21203922e+06, 8.74476287e+06, 1.18530267e+06, 1.22795873e+07,
-1.96621296e+06, 1.10879115e+07, 1.81034752e+07, -7.16766150e+05,
1.23190564e+07, 1.39291814e+07, 1.38340261e+07, 6.55111952e+06,
5.15299536e+06, 2.47848292e+07, 2.45123129e+07, 1.05923401e+07,
1.50500601e+07, 1.48358095e+07, 1.76149183e+07, 6.50121339e+05,
1.18292958e+05, 1.37516180e+07, 4.40137665e+06, 5.74620318e+06,
1.34436973e+07, 1.09028719e+07, 8.51077264e+06, 3.05750403e+07,
3.09451534e+06, 1.12145624e+07, 3.89020644e+06, 1.75796211e+07,
8.50898765e+05, 1.11346257e+07, 4.37414524e+06, 9.63343172e+06,
1.54074627e+07, 2.38710388e+06, 1.64376937e+07, 1.33908448e+07,
3.11535995e+06, -5.49774969e+04, 6.56283217e+06, 1.14385860e+07,
2.25878838e+07, 6.48895522e+06, 1.65169482e+07, -1.32202958e+06,
1.66440366e+06, 1.89956127e+06, 2.32675801e+06, 3.86590635e+06,
1.45722740e+06, 8.68480831e+05, 2.15233404e+06, 2.06664279e+05,
2.32237899e+06, 7.46726193e+06, 6.86135752e+05, 3.02530499e+06,
1.06444658e+07, 5.11870330e+06, 1.47293256e+07, 2.07276913e+07,
3.08512270e+06, 5.19340782e+06, 1.55771158e+07, 2.35089665e+05,
1.94487417e+07, 1.83154980e+07, 4.47304608e+06, 2.76544838e+06,
1.41440876e+06, -1.97888010e+04, 2.10229726e+07, -3.76994380e+05,
7.03632542e+06, 2.06689872e+07, 1.45377109e+07, 3.57211915e+06,
2.23855725e+07, 1.66796547e+07, 8.10318118e+05, 1.13048237e+07,
2.64679967e+07, 2.08475787e+07, 3.26935894e+06, 9.40474916e+06,
-1.10738959e+06, 1.71573946e+07, -3.41364592e+05, 1.74740582e+07,
2.88037361e+06, 7.69148311e+06, 3.14289305e+06, 4.39363555e+06,
1.65787749e+06, 2.44047317e+06, 1.78648733e+06, 1.81872093e+06,
1.00980317e+07, 1.05717421e+07, 4.75148880e+06, 1.67922450e+07,
1.37694633e+07, 6.23605796e+06, 1.38615845e+07, 1.25871427e+07,
6.72684567e+04, 2.48948759e+06, 1.41788958e+07, 1.12390619e+07,
4.70208384e+06, -3.04283077e+05, 1.15705881e+05, 1.33298658e+07,
1.57025243e+07, 1.01119411e+07, 7.83493802e+06, 4.10262120e+06,
-2.03389750e+06, -2.06937187e+05, 6.36582149e+06, 1.79733241e+07,
3.38638650e+05, 1.37325193e+07, 2.34210552e+07, -1.35315196e+05,
2.12097794e+06, 1.46817159e+07, -4.63167886e+05, 2.68108404e+06,
1.32503895e+07, 2.06378210e+07, 6.73191506e+06, 4.79340810e+06,
5.11593080e+05, 2.12716455e+06, -5.42439429e+05, 1.13119963e+07,
5.71486589e+06, 9.91519290e+06, 3.78801765e+06, -7.75099417e+05,
6.19710996e+06, 3.07093812e+06, 1.26415044e+07, 1.74533569e+07,
9.14632121e+05, 1.95447634e+06, -1.85545606e+05, 1.73191169e+06,
3.92527111e+05, 2.36953922e+07, 2.61755534e+05, 1.47799797e+07,
1.59060730e+07, 1.41899427e+07, 4.50361442e+06, 9.52062888e+06,
1.61969976e+07, 1.37348241e+07, 7.57406567e+06, 2.61653061e+06,
2.10551437e+07, 2.10081007e+07, 1.67652221e+07, 1.43243543e+07,
3.27612658e+05, 1.27957887e+07, 1.29086669e+07, 2.69908764e+06,
5.04470753e+06, 1.01923424e+07, 1.58046860e+07, 1.76968666e+07,
-2.24077675e+06, 2.93835799e+06, 6.18679127e+06, 2.62271183e+06,
2.01379546e+07, 1.04022428e+07, 1.26866361e+07, 8.93211356e+06,
1.71215449e+07, 2.27694709e+06, 1.93314383e+06, 1.49233987e+07,
1.48342346e+07, 1.05264741e+07, 1.72517121e+07, 1.05470942e+05,
1.44377919e+07, 2.77063898e+06, 2.57593754e+07, 5.71993416e+06,
1.86415300e+07, 9.19213000e+06, 5.58350332e+06, 3.15517890e+06,
1.40658470e+07, 2.42436594e+07, 2.94992571e+05, 4.54575265e+06,
1.17747222e+07, 1.48252127e+06, 9.87038975e+06, 1.31325332e+07,
-9.29810403e+05, 1.05284319e+07, 1.06658895e+07, 6.73433779e+05,
3.70152043e+05, 1.01864755e+04, 6.41159582e+06, 1.38259003e+07,
3.09169876e+06, 2.20232733e+07, 8.15853461e+06, 1.14385248e+06,
1.27058165e+06, 8.70389979e+06, 1.25984606e+07, 2.07727381e+07,
9.60320168e+06, 1.01163465e+07, 1.49692363e+07, 4.37504766e+05,
1.10151303e+07, 1.13000086e+07, 3.88997067e+06, 8.16158014e+05,
1.71609537e+07, 1.36497458e+07, 7.96703761e+06, 2.68665635e+06,
1.00703646e+06, 1.35512498e+06, -7.91689504e+05, 6.91234557e+05,
4.27194462e+06, 1.42940746e+07, 2.67379377e+06, 1.08065899e+07,
2.01796812e+06, 4.38913634e+06, 1.74291055e+07, 9.11175291e+06,
8.57134412e+06, -6.46812071e+05])
# Evaluation for Mean Absolute Error
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))
Using scikit-lean, the mae error is 5749113.958510288
# Evaluation for Mean Squared Error
mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
Using scikit-lean, the mse error is 118732536896093.78
# Lets use Extreme Gradient
from xgboost import XGBRegressor
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
Here we can observe changes XGBRegression shows to perform well compared to Linear Regression even without passing some parameters.
# %instatiate the model
XGB = XGBRegressor()
# %training the model
XGB.fit(X_train, y_train)
# %prediction
y_train_pred = XGB.predict(X_train)
y_test_pred = XGB.predict(X_test)
# % Evaluation
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))
mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
Using scikit-lean, the mae error is 5391249.288207617 Using scikit-lean, the mse error is 107951570443229.86
# performance of Extreme gradient boost with parameter only
X_train, X_test, y_train, y_test = train_test_split(df[cols],target, test_size=0.20, random_state = 2020)
# %instatiate the model
XGB_par = XGBRegressor( n_estimators= 100, colsample_bynode = 0.8, learning_rate = 0.02,max_depth = 7)
# %training the model
XGB_par.fit(X_train, y_train)
# %prediction
y_train_pred = XGB_par.predict(X_train)
y_test_pred = XGB_par.predict(X_test)
# % Evaluation
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))
mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
Using scikit-lean, the mae error is 4800634.196620062 Using scikit-lean, the mse error is 97851770809151.97
df_test.head()
| ID | country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | ... | package_sightseeing | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | total_people | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tour_1 | AUSTRALIA | 45-64 | Spouse | 1 | 1 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Package Tour | ... | Yes | Yes | Yes | 10 | 3 | Cash | Yes | Wildlife | 2 | 13 |
| 1 | tour_100 | SOUTH AFRICA | 25-44 | Friends/Relatives | 0 | 4 | Business | Wildlife tourism | Tanzania Mission Abroad | Package Tour | ... | No | No | No | 13 | 0 | Cash | No | Wonderful Country, Landscape, Nature | 4 | 13 |
| 2 | tour_1001 | GERMANY | 25-44 | Friends/Relatives | 3 | 0 | Leisure and Holidays | Beach tourism | Friends, relatives | Independent | ... | No | No | No | 7 | 14 | Cash | No | No comments | 3 | 21 |
| 3 | tour_1006 | CANADA | 24-Jan | Friends/Relatives | 2 | 0 | Leisure and Holidays | Cultural tourism | others | Independent | ... | No | No | No | 0 | 4 | Cash | Yes | Friendly People | 2 | 4 |
| 4 | tour_1009 | UNITED KINGDOM | 45-64 | Friends/Relatives | 2 | 2 | Leisure and Holidays | Wildlife tourism | Friends, relatives | Package Tour | ... | No | No | No | 10 | 0 | Cash | Yes | Friendly People | 4 | 10 |
5 rows × 24 columns
# let's remove ID Column
df_test.drop('ID', axis='columns', inplace=True)
# Then encode objects into numeric
for colname in df_test.select_dtypes("object"):
df_test[colname],_=df_test[colname].factorize()
df_test.head()
| country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | package_transport_int | ... | package_sightseeing | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | total_people | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 10 | 3 | 0 | 0 | 0 | 2 | 13 |
| 1 | 1 | 1 | 1 | 0 | 4 | 1 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 1 | 13 | 0 | 0 | 1 | 1 | 4 | 13 |
| 2 | 2 | 1 | 1 | 3 | 0 | 0 | 1 | 2 | 1 | 1 | ... | 1 | 1 | 1 | 7 | 14 | 0 | 1 | 2 | 3 | 21 |
| 3 | 3 | 2 | 1 | 2 | 0 | 0 | 2 | 3 | 1 | 1 | ... | 1 | 1 | 1 | 0 | 4 | 0 | 0 | 3 | 2 | 4 |
| 4 | 4 | 0 | 1 | 2 | 2 | 0 | 0 | 2 | 0 | 0 | ... | 1 | 1 | 1 | 10 | 0 | 0 | 0 | 3 | 4 | 10 |
5 rows × 23 columns
model.fit(df[cols],target)
LinearRegression()
preds2 = model.predict(df_test)
preds2
array([ 2941140.7776178, 10778644.7776178, 19337428.7776178, ...,
13447932.7776178, 15230956.7776178, 15413460.7776178])
# Lets add the predicted Total_cost now to the Test Dataset
Final_df= Final_df.assign(predicted_price= preds2)
Final_df.head()
| ID | country | age_group | travel_with | total_female | total_male | purpose | main_activity | info_source | tour_arrangement | ... | package_transport_tz | package_sightseeing | package_guided_tour | package_insurance | night_mainland | night_zanzibar | payment_mode | first_trip_tz | most_impressing | predicted_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tour_1 | AUSTRALIA | 45-64 | Spouse | 1.0 | 1.0 | Leisure and Holidays | Wildlife tourism | Travel, agent, tour operator | Package Tour | ... | Yes | Yes | Yes | Yes | 10 | 3 | Cash | Yes | Wildlife | 2.941141e+06 |
| 1 | tour_100 | SOUTH AFRICA | 25-44 | Friends/Relatives | 0.0 | 4.0 | Business | Wildlife tourism | Tanzania Mission Abroad | Package Tour | ... | No | No | No | No | 13 | 0 | Cash | No | Wonderful Country, Landscape, Nature | 1.077864e+07 |
| 2 | tour_1001 | GERMANY | 25-44 | Friends/Relatives | 3.0 | 0.0 | Leisure and Holidays | Beach tourism | Friends, relatives | Independent | ... | No | No | No | No | 7 | 14 | Cash | No | No comments | 1.933743e+07 |
| 3 | tour_1006 | CANADA | 24-Jan | Friends/Relatives | 2.0 | 0.0 | Leisure and Holidays | Cultural tourism | others | Independent | ... | No | No | No | No | 0 | 4 | Cash | Yes | Friendly People | 1.579848e+07 |
| 4 | tour_1009 | UNITED KINGDOM | 45-64 | Friends/Relatives | 2.0 | 2.0 | Leisure and Holidays | Wildlife tourism | Friends, relatives | Package Tour | ... | Yes | No | No | No | 10 | 0 | Cash | Yes | Friendly People | 8.607109e+06 |
5 rows × 23 columns
# Visualizing the Predictions
plt.figure(figsize=(15,7))
plt.scatter(y_train_pred,y_train_pred - y_train,
c = 'black', marker = 'o', s = 50, alpha = 0.5,
label = 'Train data')
plt.scatter(y_test_pred,y_test_pred - y_test,
c = 'c', marker = 'o', s = 50, alpha = 0.7,
label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper right')
plt.show()
We satisfactorily have a good result from the visualization above.
The most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”, therefore it would be wise and worthwhile for investors to focus on them more. Tourist spending was more with Friends/Relatives, followed by with Spouse and children, therefore there's need for focus on developing facilities to suit these group of people. Tourist below 65 years old spend more, so it is worthwhile to encourage this age group to come to Tanzania. Most profitable visiting countries are: USA, United Kingdom, Italy, France, Australia etc.